Tomasz Przechlewski
March, 2019
My name is Tomasz Plata-Przechlewski and I live in Poland. I was born on 16th june 1963 (it was Sunday, the exact day when Valentina Vladimirovna Tereshkova was launched into space – if you know who she is).
BTW in Poland born-in-sunday means work-shy (ie. lazy) person (so you know now first Polish? proverb)
BTW by pure statistics \(1/7 \approx 14\)% of the population is work-shy:-)
I graduated economy long time ago and taught statistics and information systems (mainly). I am a big fan of open source software (or OSS) and I knew a few OSS systems including Linux and LaTeX. And of course R which I am about to show you in a while.
My hobby is Road Cycling and History. A I am also a amateur photographer. (cf tprzechlewski@flickr)
Statistics (nothing spectacular, just classical EDA, no (heavy) math, relax)
Statistical software (modern, non-standard or hipster #youcall)
Poland (via statistical examples)
Statistics (particularly in the social domain) is the most dangerous form of a lie. Why? Because of fanciful definitions, sloppy measures, poor samples, and erroneous computations.
Students are unaware about this pitiful state of affairs.
Three components of Statistics (statistical value chain 1-st version):
Theory (models) + Tools (programs) + Practice (real data)
Undergraduate courses on statistics concentrate on theory, and use some spreadsheet as an universal computing tool.
For students statistics is a lot of math formulas + Excel = difficult and boring
Students works with artificial (clean) and small data sets thus are unaware of problems related to applying theory to practice and/or about the data definition/collection stage.
Change of concept is urgently needed. Student should be aware about the true workflow of statistical analysis:
Office software has limits. Spreadsheets are good for number crunching, but are not so good in: data cleaning (Practice), advanced graphics, spatial analysis (Practical-Theory), team work (Practice).
Office editors or Powerpoint/ are great tools but are not quality publishing of statistical results.
It is wrong to ignore the existence of modern open source tools and not introduce students to them. It is wrong not to introduce students to some (even elementary) programming, and sticking exclusively to point-and-click mode of work (ie spreadsheet).
I will try to demonstrate that using modern tools for statistical analysis is a feasible way to go. That (some) modern tools are not much more (or prohibitive) difficult that office software (at higher than basic level of usage at least)
Conclusion: less theory, more practice and common sense. Show student real *‘value chain’ of statistical analysis with all its problems (not covered nowadays):
Poor definitions: complexity of statistical data collection: imprecise, complex, unintuitive, contradictory etc definitions, imprecise (or worse meaningless) measurement, unreliable (incomplete, erroneous) data etc…
Modern tools: programmable, better quality (graphics), more reliable
New workflow based on reproducible research concept (expained momentarily)
Number of students.
Who is a student?
Student is a person attending to a 3rd level status school in in the 3-stage education system (cf Educational_stage). The answer is still non-obvious as there are many forms of tertiary education. For example:
The UNESCO stated that tertiary education focuses on learning endeavors in specialized fields. It includes academic and higher vocational education.
So according to the above definition the school do not belongs to tertiary education if its status is not academic and/or higher vocational. Example: Dance Academy or University for Elderly people (aka University of the 3rd Age). Both are popular in Poland.
In many countries there are some certification scheme. For example in Poland a school must apply (and get) a certificate to be regarded as high school (ie part of tertiary level of education)
Heads vs Majors
Student can be enrolled to more than one course (major). So for counting heads it is necessary to remove duplicates otherwise one would count majors not persons.
Part time studies
FTE stands for Full-Time-Equivalent, an approximation of the number of students who would be enrolled full-time
Full time equivalent (FTE) – FTE is based on student credit hours. It is obtained by dividing student credit hours by some a number of credit hours for full-time-study.
Conclusion: Majors, Persons or FTEs? Which is the best?
University of Utah/Office of Analysis, Assessment and Accreditation google:single multiple majors fte
Who is a tourist. According to Glossary:Tourism
Tourism means the activity of visitors taking a trip to a main destination outside their usual environment, for less than a year, for any main purpose, including business, leisure or other personal purpose, other than to be employed by a resident entity in the place visited.
According to the above definition to be regarded as tourist one has to change her/his accommodation place for less than one year (otherwise Eurostat would regard her/him as migrant)
The usual meaning (at least in Poland) is that tourist is travelling for leisure not to work. People travelling to work has other needs/aims than those travelling to rest (they usually do not use hotels for example) so the above definition solves some problems but at the same time creates many others.
Number of tourists: do not distinguish between various form of tourists, difficult to collect (who is a tourist anyway?)
Various `number of’ tourist-oriented establishments (hotels, catering units, beds, nights spent) etc. They do not measure tourists per-se but are highly related and more reliable (as easier to count).
Indicator of tourist activity (by various tourist types).
Conclusion: measurement of tourism activity is not trivial Other similar: internet user, migrant, unemployed person, illiterate person
NACE = Statistical Classification of Economic Activities = the industry standard classification system used in the European Union. ACE uses four hierarchical levels: SectionDivision.Group.Class, where Section is denoted by a single letter. Examples:
A01.44 A = Agriculture, forestry and fishing (section); 01 = Crop and animal production, hunting and related service activities (division); 01.4 = Animal production (group); 01.44 = raising of camels and camelids (section)
I55.1 Accommodation and food service activities (I); Accommodation (55); Hotels and similar accommodation (55.1)
Two sides of tourism: supply side (hotels) / demand side (tourists)
BTW: demand = how much a product/service is desired; supply = how much the market can offer
Tourism supply statistics (accommodation statistics): Data on rented accommodation ie. capacity and occupancy of tourist accommodation establishments in the reporting country. How collected? Registers?
How statistical data is collected?: exhaustive data vs sample. Exhaustive: dedicated sureys (obligatory reports) vs administrative registers (births, deaths, police statistics). Sample: representative sample vs random sample vs panel data. Panels (cf Panel Research). Panels are overused nowadys (cf https://panelariadna.pl/):
Quirks of data collection: Data up to year 2015 inclusive refer to only those units that made the statistical reports. Starting of data for January 2016, the method of imputation data was implemented (ie replacing missing data with some (possibly meaningful :-)) values. (cf BDL)
Impu-what? Missing data problem
Tourism demand statistics: Data on participation in tourism of the residents of the reporting country. How collected? Surveys?
Most of the time, data on domestic and outbound trips (where “outbound tourism” means residents of a country travelling in another country) is collected via sample surveys (cf Annual data on trips of EU residents and Tourism_statistics_-_top_destinations)
Regulations concerning data collection in tourism (hundreds of pages): Glossary:Supply_side_tourism_statistics and EU regulation No 692/2011
So now we know what we are dealing with…
No doubt in every reliable survey the population has to be precisely defined ie 3 dimensions of every surveyed unit should be fixed: definition (what), time (when measured), space (where)…
I always repet to my students: if you look at some data (in the media for example), start from establishing if you know what, when and where. If no information (or reliable link–called source–to information) is provided on any of the fixed dimensions of data, treat this data as rubbish and do not waste time to use/analyse it.
Further dissemination of such defective data should be subjected to publicly prosecuted (joke)
I tried to show you already that what is complicated and often highly unreliable/arbitrary (the nature of the phenomenon or/and measurement difficulties).
When dimension is much more simpler due to universal standard, ie. time. You gather data or for a certain moment (how many hotels are in use in 31st December 2018) or for certain period of time (how many beds were sold in these hotels in 3rd quarter of 2018).
Where dimensions in turn is usually based on administrative or statistical (geographical) units (country, state/province, county, community). But contrary to time dimension there is no universal or globally-accepted standard for geostatistical units. Usually such a standard is based on administrative system which is country-dependent.
The administrative division of Poland since 1999 has been based on three levels of subdivision (cf Administrative divisions of Poland. In 2001 as Poland became a member of European Union, EU regulations are part of national law system.
EU regulates everything, statistics included.
Conclusion: The pigs had to expend enormous labours every day upon mysterious things called “files,” “reports,” “minutes,” and “memoranda.” These were large sheets of paper which had to be closely covered with writing, and as soon as they were so covered, they were burnt in the furnace (George Orwell, Animal Farm)
The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)
NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)
NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – subregion (several counties in case of Poland)
Poland is divided into 7 macroregions, 16 states (NUTS2), and 72 subregions (NUTS3).
NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )
There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce )
The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW province in Polish is “prowincja” (due to both are from Latin) but actually Polish administrative provice is called “województwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every province ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland)
NUTS3 consists of 380 counties grouped into 72 subregions.
A Polish county (called “powiat”) is 2-nd level administrative unit.
In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta“.”Stary" means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)
The 3rd level administrative unit is called “gmina” (community).
There are (approximately) 380 counties and 2750 communities in Poland.
As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each community has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina“.
TERYT is a Polish NUTS (developed some 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “województwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-community (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).
So you are now experts on administrative division of Poland, and we can go back to statistical charts…
Indicators can be divided to hard indicators and soft indicators. Hard indicators denote hard facts while soft indicators are beliefs and intentions. For example number of hotels is a fact, while intention to stay abroad less than a year is not a fact but an intention. In Poland at least 80% respondents declares they intend to vote, while the true turnover never exceed 55%. In other words measuring something using soft indicators is prone to (significant) errors.
That not means that hard indicator is error-free. By definition it measures not the phenomenon but some proxy associated with the phenomenon.
With hard indicator we have precise measurement of imprecise measure. With soft indicator we have imprecise measurement of imprecise measure.
To cure (or hide) the problems aggregates of indicators are constructed, eiter as sums (indexes or formative) or as averages (factors or reflective. Indexes are more popular in economics while averages/factors are more popular in psychology, sociology etc…
For example Gross National Product (GDP) is an index while (customer) satisfaction defined as some set of opinions on a product would be a factor.
Control question: what is measured with GDP?
Collection methods from most to last reliable:
administrative registers (almost all)
some obligatory reports, some sample-based
panels, obligatory reports when respondent is unable/unwill/ to provide information or take part at all), most sample-based (intensions, soft factors)
Typical collection method description of a sample based survey: Data collected from 1st April to 2nd April 2019. Cross national panel (or sample). Respondent age +18. Panel size 1020. Quotas representative for sex, age and residence type
No information provided on non-response rate/non-contact rate (why?).
Example: How to measure illiteracy? A. Ask a straight question (can you read/write). B. Ask a question how many books respondent read last year, if zero = illiterate (nasty!). C. Ask for certificate (infeasable). I wonder about illiteracy rate of many countries if approach B would be excercised:-)
In spite of the fact that statistical charts are now ubiquitous in the media this topic is usually covered marginally at most courses on statistics, probably because it is pretty hard to produce quality graphics with office software (complexity vs difficulty).
Statistical charts can be plotted for the following three purposes:
Decoration (to attract somebodys’ attention, document without pictures looks dull, color pictures are better than back-white ones, fanciful drawings are better than simple ones, form is the king and content does not matter)
Explanation (to better explain some phenomenon to somebody. It is claimed that a picture is worth thousand of words in this context)
Exploration (looking for data patterns at the exploratory stage of data analysis)
Note: It is often recommended by some researchers to use charts at data cleaning stage of statistical analysis. I do not agree with it. Data cleaning can be automated and should not relay nor on manual work nor on visual inspection. Using programs to check data is more efficient and reliable procedure. It is also 100% replicable contrary to visual inspection.
A visual-art designer not statistician is a right person for the 1st purpose. I am not an art-designer so I will not tell you how to prepare eye-catching pictures. I am a statistician and I will concentrate on effective graphical methods for statistical explanation/exploration. And by effective I mean that one (graphical) method is more effective than another if its quantitative information can be decoded more quickly/easily [Robbins 2005]
Some graphs are better than others:
Recommended: (ordered) dot plots, bar charts, lineplots, histograms and kernel density estimates, stripcharts, multipanel displays (instead of stacked bars multiple line/dot plots) scatterplots (two variables)
Not recommended: Pie charts, bubble charts, stacked bar charts,
Note: bar/line/pie charts were introduced by William Playfair in XVIII century. Dot plots were introduced by John Cleveland (1980s). Box-plots were introduced by John Tukey (1970s)
More Playfair’s charts can be found via google or in Syamnzik’s paper
Bed places vs nights spent (diameters equals to country’s population)
A strip chart (strip plot) shows the distribution of data points along a numerical axis.These plots are suitable compared to box plots when sample sizes are small (because preserve more information about the data).
Example: Number of hotels in powiat by region (NUTS1, 2017):
The biggest potential problem with a dot/scatterplot is overplotting: whenever one has more than a few points, points may be plotted on top of one another. This can severely distort the visual appearance of the plot (left panel)
There is no one solution to this problem, but there are some techniques that can help: use smaller dots, use semi-transparent dots (right panel), use jitter.
Jitter—a small random noise added to data, is shown below (higher jitter on the right panel)
Histograms show the distribution of a set of data. To draw a histogram the numbers (observations) are grouped into bins (intervals or classes). There is a trade-off between showing details or showing an overall picture. When bin width changes the scale at Y-axis changes as well (more bins less observations in each bin). Example number of hotels in Poland (2017):
ggplot(d, aes(x = hotele2017)) +
geom_histogram(bins = nclass.Sturges(d$hotele2017))Histograms with binwidth equal to 20, 10, 5 and 1 respectively:
Drawback of histogram: scale is bin (width) dependent.
Kernel density functions
ggplot(data=d) + geom_density(aes(x=hotele2017))p1 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=0.25)
p2 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=1.0)
p3 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=2.0)
p4 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=8.0)
ggarrange(p1,p2,p3,p4)Box-plots are much better than histograms for comparing distributions of more than one data sets.
Construction of a (typical) box-plot: The middle bar is a median. Top/bottom bars of the rectangle shows the IQR (interquartile range is 1st and 3rd
quartile), the fanciful bars above/below rectangle called whiskers (google: whiskers mustache :-) are 1,5 times the IQR (or minimu/maximum if those values are less than plus/minus 1,5 IQR. The symbols above/below whiskers (usually open circles) are outliers (non typical/extreme values)
Note the trick: outliers are defined not as (for example) top/botom 1% fraction of values (every distribution would has outliers in such a case) but as values less/more than Me - 1,5IQR (distributions with medium variablity would not have outliers)
Example: age of Nobel-prize winners (cf The Nobel Prize API Developer Hub)
nlf <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",", header=T, na.string="NA");
ggplot(nlf, aes(x=category, y=age, fill=category)) + geom_boxplot() + ylab("years") + xlab("");## Warning: Removed 39 rows containing non-finite values (stat_boxplot).
Multiple histograms are too detailed (binwidth=5). It is impossible for example to establish which category has the youngest (on the average) laureate, or which category has an oldest one (economics and literature are candidates, but due to multimodality of literature laureates distribution it is difficult to assess this for sure…)
Number of hotels in powiat by województwo (2017):
More jitter:
Boxplots are better:
A scatter-plot (aka scatter diagrams, xyplot) is a basic form used for two (quantitative) variables.
To see the relationship between variables, a line is can be fitted. Least square (LS) line which assumes linear relationship between variables, is fitted by minimizing the sum of squares of the residuals (residual is the difference between a data-point and a relevant line-point ie a point computed from the formula y = a +bx where x is the value of the x-axis variable.)
(Almost) each part of Poland is attractive for tourists, but those counties which are at the seaside (north) or in the mountains (south) are special. There are 11 counties at the seaside (morze = sea) and 18 in the mountains (góry):
##
## Call:
## lm(formula = tz2017 ~ y2017, data = m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127494 -22228 4466 20779 84551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52055 45219 -1.151 0.2793
## y2017 5839 2324 2.513 0.0332 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60310 on 9 degrees of freedom
## Multiple R-squared: 0.4123, Adjusted R-squared: 0.347
## F-statistic: 6.314 on 1 and 9 DF, p-value: 0.03316
##
## Call:
## lm(formula = tz2017 ~ y2017, data = m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26224 -5432 2165 5616 25199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11071.5 5820.8 -1.902 0.086330 .
## y2017 961.7 161.4 5.960 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13270 on 10 degrees of freedom
## Multiple R-squared: 0.7803, Adjusted R-squared: 0.7584
## F-statistic: 35.52 on 1 and 10 DF, p-value: 0.0001394
So each new hotel in the mountains on the average would attract 961.6 foreign tourists, while a new hotel at the seaside would attract 5838 foreign tourists (and both numbers are statistically significant at \(\alpha=0.05\):-) )
Alternatively loess curve can be used which do not assumes linearity but is parameters are not interpretable.
Logarithmic scale makes it possible to plot values with too wide range for a linear scale. Base 10 logarithms squeeze' the numbers more than base 2 logarithms (log10(100)=2 wile log2(100)=6.64. Moreover if the original scale contains multiplications of 10 use log10 to getnice’ log-scale while it contains multiplications of 2 use log2.
Logarithms transforms additive scale to `multiplicative’ one. Example (Nobel prize again):
dA <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",", header=T, na.string="NA");
nrow(dA)## [1] 934
dS <- subset(dA, (! bornCountryCode == "" )) # by country of birth
nrow(dS) # how many## [1] 901
aggregate by bornCountryCode
Finally plot the resulting data using various Y-axis scales (arithmetic, log2 and log10)
The exact figures are as follows:
##
## AR AT AU AZ BA BD BE BG BR BY CA CH CL CN CO CR CY
## 0 4 17 10 1 2 1 9 1 1 4 19 17 2 12 2 1 1
## CZ DE DK DZ EG ES FI FR GB GH GP GR GT HR HU ID IE IL
## 6 82 12 2 6 7 5 55 100 1 1 1 2 1 9 1 5 6
## IN IR IS IT JP KE KR LC LR LT LU LV MA MG MK MM MX NG
## 8 2 1 19 26 1 2 2 2 3 2 1 1 1 1 1 3 1
## NL NO NZ PE PK PL PT RO RU SE SI SK TL TR TW UA US VE
## 18 12 3 1 3 25 2 4 26 29 1 1 2 3 1 5 269 1
## VN YE ZA ZW
## 1 1 9 1
From the best to the worst:
Position along common scale
Position along common but nonaligned scales
Length
Angle (slope)
Area
Volume
Color (hue), Color (saturation), Color (density of black)
Angle judgement is not precise. Acute angles are underestimated while obtuse angles (greater than 90) are overestimated.
Area judgement is biased as well. It is impossible to distinguish small differences in area, while quite easy when the same date is plotted along common scale
The most accurate of graphic task is positioning along common scale
Clear content: Reader/receiver/consumer clearly understands what is graphed: scales/labels/explanations are provided (remember what/when/where?)
Clear form: reader/receiver/consumer clearly sees what is graphed (no cluttered lines, overlapping elements, etc…)
Emphasize the data not grids, labels or pointless arrows. The simpler the better, leave complicated designs for professionals. For example use gray not black ink (default in Excel) for grid lines
Tick marks and axis labels should be placed outward. X-axis values increase always from left to the right, Y-axis values from the bottom to the top never in reverse direction. Do not overdo the number of tick marks.
Preparing color chart think how it will look like when reproduced in B-W (xerox) or half-size or less (smartphone). It is important particularly in electronic print.
Never use more graphic feature than your data set has dimensions. For univariate analysis use length or color not both for example. (Well rare exceptions to this rule are allowed)
Pseudo 3D charts for 2D data should be forbidden as well and without any exception. Virtually no-one can read them.
Use a common baseline wherever possible. Use optimum aspect ratio (banking to 45, see below). Use logaritmic scales when data range is huge, do not break scales and generally always include 0 in numerical axes (not 100% obligatory however.) Do not (generally) use double axes.
Prefer direct labels over using separate legend. Separate legend forces the reader to look back and forth when studying the graph. Of course if there is no room for (long) labels use legend.
Multiline graphs generally are bad idea (different scales, clutter, difficulty with assessing the difference between lines)
Do not use crappy software which produces charts in proportion to data
Edward Tufte (a renown expert on Information Visualization) coined two popular rules: (high) data to ink ratio and a lie-factor.
Ink in this definition refers to non-erasable ink used for the presentation of data. If data-ink would be removed from the image, the graphic would lose the content. Non-Data-Ink is accordingly the ink that does not transport the information but it is used for scales, labels and edges.
Good graphics should include only data-Ink. Non-Data-Ink is to be deleted everywhere where possible. The reason for this is to avoid drawing the attention of viewers of the data presentation to irrelevant elements. There is an short an excellent video clip at YouTube which illustrates this rule.
Lie factor (LF) is a ratio as well but defined as size of the effect shown in graphics to the size of effect in data. Preferaby LF should equal 100%. According to Tufte, LF greater than 1.05 or less than 0.95 signals significant distortion. This rule can be best explained with an example.
This giant guy (GG) in the middle is our ex-president. The guy next to him on the left is our current president Duda. Next to Duda is ex-rock star Kukiz, dark-horse of the elections. This is the cover (slightly modified) of influential polish weekly magazine form May 2015, shortly before elections.
The figures are claimed to be in-sync with the recent survey results (sort of a barchart). Could you figure-out from that chart about the proportion of scores of each candidate? How much the giant-guy outperforms the runner-up candidate? Which candidate is supported by this influential magazine (easy:-)?
The lie-factor details:
The line from shoes to top of the head equals (at certain size of course) 204mm for GG, 134mm for Duda and 42.5mm for ex-rock star. So \(204/134=1.5\) and \(204/42.5 \approx 4.8\). As \(44/29 \approx 1.5\) and \(44/9 \approx 4.8\) as well formally the lieFactor is perfect. But should one compares lengths or areas?
If one compares areas not heights, one get significantly different (and correct) results, namely: \((204 * 58) /(134 * 21)= 4.20\) and \((204 *58)/(42.5 *15) \approx 18.56\). Lie factor is \(4.2/1.5 =280\)% and \(18.56/4.8=387\)% respectively. Huge distortion
Moreover two more tricks were applied to boost GG. Can you see them?
BTW: the text in the pink frame claims: “figure ratios are consistent with april-may survey outcome.”" (But what exactly figure ratios means?)
The ratio between the width and the height of a rectangle is called its aspect ratio.
The aspect ratio describes the area that is occupied by the data in the chart. A change in aspect ratio changes the perception of the graph. The question is which aspect ratio is the best.
We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.
Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.
Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).
Exercise: assess which slope is the steepest one and which is the smallest one?
BTW: every chart presents the same data on CO2 emission (average for May each year) as provided by US Government’s Earth System Research Laboratory, Global Monitoring Division. (cf CO2 PPM - Trends in Atmospheric Carbon Dioxide)
I do not intend to give you a full lecture on spatial methods/analysis now. First of all I am not an expert in this area. Second most of the methods develops by cartographers are not used in the domain of social-sciences. But a few are popular, are simple and are pretty impressive (to family and friends at least (F&F), ie. for non-professionals) Why not to use them? (To make impression on your F&F and/or your boss?). These methods are:
choropleth map
feature map
heat map
A choropleth map is a thematic map where geographic regions are colored, shaded, or patterned in relation to a value.
Feature map is a map augmented with position of an object of interest somehow marked.
A heat map represents the intensity of objects occurrence within a dataset. A heatmap uses color to represent intensity, though unlike a choropleth map, a heatmap does not use geographical or geo-political boundaries to group data. This technique requires point geometries, as you are looking to map the frequency of an occurrence at a specific point.
One can think of choropleth map as a kind of spatial histogram, while feature map (heat map) is a kind of spatial dot-plot (fanciful spatial dot-plot).
Google Geoservices are now non-free if used not “directly” but with API. One have to register an credit/debit card and sign some obscure license to use them. I used to use Google for years but stop using them last year.
Google shut down some cool geo-services including Google Fusion Tables (launched 10 years ago). I was a big fan of GFT and I am greatly disappointed about the decision to shutting them down now.
QGIs is a full-featured, matured (2002) and powerful open source geographic information system (GIS) software.
It allows to analyze and edit spatial information, in addition to composing and exporting graphical maps. QGIS supports both raster and vector layers; vector data is stored as either point, line, or polygon features. Multiple formats of raster images are supported, and the software can georeference images.
QGIS supports shapefiles, PostGIS (ie the most important ones), and other formats. Web services, including Web Map Service and Web Feature Service, are also supported to allow use of data from external sources (Open Street Map for example).
QGIS integrates with other open-source GIS packages, including PostGIS, GRASS GIS, and MapServer. Plugins written in Python or C++ extend QGIS’s capabilities.
To start with QGIS simply go to www.qGIS.org, download it and start installtion. No need to learn the whole system. Being (somehow) acquainted with Project and Layer/Add Layer menu item is enough:
Project: it manages file, opening, saving, printing (maps). Every program has such an top-menu-item usually called File
Layer: it allows adding, removing, coloring, lay outing for map layers taken from different data files. In particular it contains Layer -> Add Layer menu item which is the only one I use.
A CSV file PL_powiaty_2017.csv which I compile for this lecture contains cross-sectional data for every Polish powiat (generally for 2017.) Among other things one can find there:
teryt, wteryt, nuts1 (powiat identifiers explained above)
basic information on area and population (areaH, areakm, pop, popkm)
number of hotels (hotele2012, hotele2017)
number of high schools/graduates (wSzkoly, absolwenci)
number of companies, their aggregated income and profit from Rzeczpospolita 2000 ranking (firmyRz, przychodRz, wynikNettoRz)
“powiat revenue per inhabitant (as transferred by central government, przychodMF). This information is publicly available (distributed by Polish Financial Minister) contrary to GDP per powiat for example which is nota available. Powiat’s revenues are indicators of powiat economic strength, poor powiats has low transfers while a rich one has high…
d <- read.csv("PL_powiaty_2017.csv", sep = ';', header=T, na.string="NA");
revF <- fivenum(d$przychodMF)
revM <- mean(d$przychodMF, na.rm=T)
revD <- sd(d$przychodMF, na.rm=T)
c(revF, revM, revD)## [1] 90.39000 150.66500 183.42000 235.77500 679.11000 202.47941 77.62899
ggplot(d, aes(x = przychodMF)) + geom_histogram(bins = nclass.Sturges(d$przychodMF))ggplot(d, aes(x = przychodMF)) +
geom_histogram(binwidth = 40) # about 10 USD (as of march 2019)So on the average the revenues was 202.48 and the relative dispersion 38.34% złotych (fortunately Poland do not “join Euro area”, and we still use local currency called złoty; złoty means literally “made of gold” BTW). Half of powiats’ revenues was between 150.66 złoty and sprintf("%.2f", r revF[4]) złoty or PLN (Q1/Q3) with 90.39 PLN minimum and 679.11 PLN maximum incomes respectively.
To understand the spatial distribution of wealth one can plot choropleth map (using QGIS not R):
Number of high schools
Number of hotels
population density (number of people per kilometer square)
More examples can be found at my Github account (URL will be provided on the last slide)
A World Heritage (WH) Site is a place that is listed by the United Nations Educational, Scientific and Cultural Organization (UNESCO) as having special cultural or physical significance. There is a inventory of WH sites at whc.unesco.org. This list are available in various formats including Excel format and when rendered with QGIS looks like:
Heatmaps shows density more clearly (or not–opinions are contradictory):
Remember Nobel winners by country? With heat-maps one can plot them on the map:
Every year Rzeczpospolita, a nationwide daily economic and legal newspaper compiles a list of 2000 biggest companies (idea similar to Fortune 500 list) The distribution is highly skewed and concentrated, but what about spatial distribution of Polish big companies? Feature/Heat to the rescue…
BTW: the small picture in the middle depicts Poland during Weichselian and Würm cold period (15,000–11,700 years ago, cf Weichselian glaciation)
BTW2: land use Poland vs Ozbekiston
First short explanation about the subject of the analysis ie famous Castle of the Teutonic Order in Malbork which is enlisted at UNESCO heritage list (cf UNESCO heritage list ):
Several religious military orders were formed in the Holy Land during the Crusades Templars, Hospitallers, Teutonic Knights
The Teutonic Knights or the Teutonic Order of the Hospital of St. Mary in Jerusalem, were known in Poland as Krzyżacy on account of the black cross they wore on their white coats.
Established in 1190 to protect German pilgrims in the Holy Land, the order was later transformed in order to fight heretics.
In 1226 the Teutonic Knights came to Poland, invited by Duke Konrad I of Mazovia to fight with the annoying pagan Prussian tribes invading Poland from time-to-tme from the north. Teutonic Knights conquered Prussia, exterminated the locals and founded a powerful state with Malbork (Marienburg or Mary’s castle in German) as its capital.
BTW: Kwidzyn in German is called Marienwerder (Mary’s meadow) and there were a lot more places named Marien-something (as Marien is St Mary in German)
BTW2: There is about 40km from Kwidzyn to Malbork :-)
There is a research, peer-reviewed paper on tourist traffic in the castle’s museum of Malbork
The determinants of the tourist traffic in the castle’s museum of Malbork
Unfortunately all charts in this paper contains elementary errors. Could you identify them?
if one insists on using piecharts (improved version):
or better, using bar/dot charts:
Piecharts are notorious for obscurity:
What about this barchart (distribution of seats in Polish parliament (Sejm) after 2015 elections—50% majority is 430 seats)?
Remember dark-horse ex-rock start Kukiz? IMO his bar does not looks like being equal to 50 votes (minus 1.) PO-bar is peculiar as well…
Not mention about strange tilt to the left…
R is both programming language for statistical computing and graphics and a software (ie application) to execute programs written in R. R was developed in mid 90s at the University of Auckland (New Zealand).
Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines.
BTW why it is called so strange (R)? Long time ago it was popular to use short names for computer languages (C for example). At AT&T Bell Labs (John Chambers) in mid 70s a language oriented towards statistical computing was developed and called S (from Statistics). R is one letter before S in an alphabet.
Rstudio is an environment through which to use R. In Rstudio one can simultaneously write code, execute code it, manage data, get help, view plots. Rstudio is a commercial product distributed under dual-license system by RStudio, Inc. Key developer of RStudio is Hadley Wickham another brilliant New Zealander (cf Hadley Wickham )
Microsoft invest heavily into R development recently. It bought Revolution Analytics a key developer of R and provider of commercial versions of the system. With MS support the system is expected to gain more popularity (for example by integrating it with popular MS products)
(univariate analysis)
The CSV file hotele_caloroczne_PL.csv contains data on number of all-season hotels in every county in Poland. First one has to load the dataset with the read.csv command:
d <- read.csv("hotele_caloroczne_PL.csv", sep = ';', header=T, na.string="NA")Computing measures of central tendency (with summary and/or fivenum)
summary(d)## teryt powiat hotele2012 hotele2017
## Min. : 201 bielski : 2 Min. : 0.000 Min. : 0.00
## 1st Qu.:1005 brzeski : 2 1st Qu.: 3.000 1st Qu.: 4.00
## Median :1636 grodziski : 2 Median : 5.000 Median : 7.00
## Mean :1721 krośnieński: 2 Mean : 8.776 Mean : 10.31
## 3rd Qu.:2475 nowodworski : 2 3rd Qu.: 10.000 3rd Qu.: 11.00
## Max. :3263 opolski : 2 Max. :158.000 Max. :183.00
## (Other) :368 NA's :1
fivenum(d$hotele2017)## [1] 0 4 7 11 183
Computing mean:
mean(d$hotele2017)## [1] 10.31053
And dispersion:
var(d$hotele2012); var(d$hotele2017)## [1] NA
## [1] 244.8743
sd(d$hotele2012); sd(d$hotele2017)## [1] NA
## [1] 15.64846
Second attempt (with no output/respective values was saved as variables var12…sd17):
var12 <- var(d$hotele2012, na.rm=T); var17 <- var(d$hotele2017, na.rm=T)
sd12 <- sd(d$hotele2012, na.rm=T); sd17 <- sd(d$hotele2017, na.rm=T);BTW:
c( mean(d$hotele2012, na.rm=T), mean(d$hotele2017, na.rm=T))## [1] 8.775726 10.310526
Or more formally. There were 8.7757256 hotels on the average in every county in Poland in 2012 while in 2017 there were 10.3105263 hotels.
Interquartile Range aka IQR which is the range from the upper (75%) quartile to the lower (25%) quartile. IQR represents central 50% observations of a population. IQR is a robust measure of dispersion, unaffected by the distribution of data:
c( IQR(d$hotele2012, na.rm=T), IQR(d$hotele2017, na.rm=T))## [1] 7 7
Finally we can equally easily assess the skewenss:
library(moments)
c(skewness(d$hotele2012, na.rm=T), skewness(d$hotele2017))## [1] 5.998884 5.899827
Distribution skewness is significant in both periods. Using (modified) Persons’ formula \((\bar x -D )/ \sigma^2\) we obtain:
library("DescTools")
(mean(d$hotele2017) - Mode(d$hotele2017) )/ sd17 ## [1] 0.4032682
Still the distribution is positively skewed, but the value of the coefficient is much smaller.
Sorry but why use all this strange stuff at all? The most important argument why I will present momentarily and it concerns the basic approach of doing statistical analysis.
This mode (or concept) is called Reproducible Research (RR in short).
Serious statistical analysis is not one-off job. There is a value-chain as well as a life cycle of statistical analysis.
Value chain means that there are distinct stages while life cycle that the same data/models are used for years and most statistical analysis do not start from the scrach but are based on data from the past augmented with new data.
The problem is that the new data and model modifications should be in-sync with the past.
The make the problem worse, serious statistics should be also in-sync with the work of others (to ease or to make possible any meaningful (international) comparisons for example)
Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. Eric S. Raymond, E. S. The art of UNIX programming: Addison-Wesley.
Replicability vs Reproducibility
Hot topic: google: reproducible research = 158000
Replicability: independent experiment targetting the same question will produce a result consistent with the original study.
Reproducibility: ability to repeat the experiment with exactly the same outcome as originally reported [description of method/code/data is needed to do so].
Computational science is facing a credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today. (Donoho D. et al 2009)
Use Excel for data cleaning & descriptive statistics Excel handles missing data inconsistently and sometimes incorrectly Many common functions are poor or missing in Excel
Use SPSS/SAS/Stata in point-and-click mode to run serious statistical analyses.
Prepare report/paper: copy and paste output to Word/OpenOffice, add description.
Send to publisher (repeat 1–4 if returned for revision).
Problems
Tedious/time-wasting/costly.
Even small data/method change requires extensive recomputation effort/careful report/paper revision and update.
Error-prone: difficult to record/remember a ‘click history’.
Famous example: Reinhart and Rogoff controversy Countries very high GDP–debt ratio suffer from low growth. However the study suffers serious but easy identifiable flaws which were discovered when RR published the dataset they used in their analysis (cf Growth_in_a_Time_of_Debt)
Abandon spreadsheets.
Abandon point-and-click mode. Use statistical scripting languages and run program/scripts.
Benefits
Improved: reliability, transparency, automation, maintanability. Lower costs (in the long run).
Solves 1–2 but not 3–4.
Problems: Steeper learning curve. Perhaps higher costs in short run. Duplication of effort (or mess if scripts/programs are poorly documented).
Literate programming concept: Code and description in one document. Create software as works of literature, by embedding source code inside descriptive text, rather than the reverse (as in most programming languages), in an order that is convenient for human readers.
A program is like a WEB tangled and weaved (turned into a document), with relations and connections in the program parts. We express a program as a web of ideas. WEB is a combination of – a document formatting language and – a program language.
General idea of Literate statistical programming mimics Knuth’s WEB system.
Statistical computing code is embedded inside descriptive text. Literate statistical program is weaved (turned) into report/paper by executing code and inserting the results obtained. data/method changes.
Solves 1–4.
Reliability: Easier to find/fix bugs. The results produced will not change when recomputed (in theory at least).
Efficiency: Reuse allows to avoid duplication of effort (Payoff in the long run.)
Transparency: increased citation rate, broader impact, improved institutional memory
Institutional memory is a collective set of facts, concepts, experiences and know-how held by a group of people.
Flexibility: When you don’t ‘point-and-click’ you gain many new analytic options.
Problems of LSP: Many incl. costs and learning curve
Tools:
Document formatting language: LaTeX (not recommended) or Markdown (or many others, ie. orgmode). LaTeX is a word processor/a document markup language. Markdown: lightweight document markup language based on email text formatting. Easy to write, read and publish as-is.
Program language: R
The basic idea is that instead of manually registering changes one has made to data, documents etc, one can use software to help him manage the whole process. Such software is called Version Control Systems or VCS
VCS not only manages content, registering each modification of it, but control access to the content as well. Thus many individuals can work on common project (compare this to common scenario of mailing spreadsheets to each other–highly inefficient at least)
There are highly reliable and publicly available VCS services and GitHub is the most popular of them.
GitHub is owned by Microsoft (do not use if you boycott MS :-))
I use GitHub as an educational tool: to distribute learning content to my students and to store content they produce for me (ie projects)
The free GitHub account is public. It is OK for me. If it is not OK for you, you can buy a license for commercial account or do not use GitHub.
R/Rstudio for computing and data visualization
Github for enhancing team work
markdown for reproducible research
some other tools: QGIS for example
New practice
Introduce reproducible research approach
Use real (big and dirty) data sets.
Introduce some programming (Programing or using mouse?)
Introduce some new tools (R/Rstudio, QGIS, Github)
Learnig resources
bookdown: Authoring Books and Technical Documents with R Markdown
Supplementary resources to my lecture (slides/data/R scripts etc) are available at: https://github.com/hrpunio/Z-MISC/tree/master/Erasmus/2019/Namagan
Data banks